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PAOLILLO MAURIZIO 



' Modern photometric multiband digital surveys produce large amounts of data that, 

in order to be effectively exploited, need automatic tools capable to extract from 
photometric data an objective classification. 

We present here a new method for classifying objects in large multiparametric 
photometric data bases, consisting of a combination of a clustering algorithm and 
a cluster agglomeration tool. The generalization capabilities and the potentialities 
of this approach are tested against the complexity of the Sloan Digital Sky Survey 
archive, for which an example of application is reported. 



1. Introduction 

In the last few years the astronomical community is experiencing a tremendous 
growth in the size, quality and accessibility of databases. This trend will accelerate 
in the coming years due to the advent of large dedicated survey telescopes and to 
the implementation of the International Virtual Observatory infrastructure. There 
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is the practical fact that the extraction of useful information from such datasets 
cannot be effectively performed with traditional tools, and from the methodological 
point of view, the wealth of information contained in such huge data sets imposes to 
abandon old conceptual schemes largely based on the 3-D visualization capability 
of human minds and to adopt "ad hoc" statistical pattern recognition, classification 
and visualization methods. The application of such algorithms to the astronomical 
case is all but trivial, due to the complexity of astronomical data which usually 
present strong non linear correlations among parameters and are highly degenerate. 
Among the other tools, especially relevant to the astronomical case are those which 
deal with the identification and visualization of groups of objects sharing the same 
physical properties. 

2. The methods 

The method outlined here follows a hierarchical approach which, starting from a 
preliminary clustering performed using a clustering algorithm, the "Probabilistic 
Principal Surfaces" , followed by a second phase that uses the Negative Entropy 
concept and a dendrogram structure to agglomerate the clusters found in the first 
phase. 

2.1. PPS - Principal Probabilistic Surfaces 

Probabilistic Principal Surfaces (PPS) are a nonlinear extension of principal com- 
ponents, in that each node on the PPS is the average of all data points that projects 
near/onto it. PPS define a non-linear, parametric mapping y(x; W) from a Q- 
dimensional latent space (x G W^) to a D-dimensional data space (t G M^), where 
normally Q « D. The function y(x;W) (defined continuous and differentiable) 
maps every point in the latent space to a point into the data space. Since the latent 
space is Q-dimensional, these points will be confined to a Q"dimensional mani- 
fold non-linearly embedded into the D-dimcnsional data space. In our method, 
the points belonging to the parameter space will be projected on the surface of 2- 
dimensional sphere. The visualization capabilities of the PPS can prove very useful 
in several aspects of the data interpretation phase such as, for instance, the local- 
ization of data points lying far away from the more dense areas (outlayers), or of 
those lying in the overlapping regions between clusters, or to identify data points 
for which a specific latent variable is responsible. 

2.2. NEC - Negentropy Clustering 

Most unsupervised methods require the number of clusters to be provided a priori, a 
serious problem when exploring large complex data sets where the number of clusters 
can be very high or unpredictable. A simple treshold criterium is not satisfactory in 
most astronomical applications due to the high degeneracy and the noisiness of the 
data which lead to erroneous agglomeration, while a different approach based on the 
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combination of a similarity criterium based on the concept of Negative Entropy and 
the use of a dendrogram as agglomerative algorithm is achievable. We implement 
the Fisher's linear discriminant which is a classification method that first projects 
high-dimensional data onto a line, and then performs a classification in the projected 
one-dimensional space^. On the other hand, we define the differential entropy H of 
a random vector y = (yi, . . . , y„)^ with density /(.) as H{y) = / /(y) log /(y)dy 
so that negentropy J can be defined as J(y) = J (y Gauss) - H{y), where y Gauss 
is a Gaussian random vector of the same covariance matrix as y. The Negentropy 
can be interpreted as a measure of non-Gaussianity and, since it is invariant for 
invertible linear transformations, it is obvious that finding an invertible transfor- 
mation that minimizes the mutual information is roughly equivalent to finding di- 
rections in which the Negentropy is maximized. Our implementation of the method 
use approximations of Negentropy that give a very good compromise between the 
properties of the two classic non-Gaussianity measures given by kurtosis and Ne- 
gentropy. Negentropy can be used to agglomerate with an unsupervised method the 
clusters (regions) found by the PPS approach. The only a priori information is a 
dissimilarity threshold T. We suppose to have c multi-dimensional regions Xi with 
i = 1, . . . , c that have been defined by the PPS approach, then passing these regions 
to the Negentropy Clustering algorithms which, in practice, measures whether two 
clusters could or could not be modeled by one single Gaussian or, in other words, 
if the two regions can be considered to be aligned or as part of a greater data set. 

3. The data 

All data used in this work are extracted from the Data Release 4 of the Sloan 

o 

Digital Sky Survey ( ). This spectroscopic subsample will constitute the "knowledge 
base" on which we have founded the labelling of the unsupervised ones. The SDSS 
also provides, for each object in the SpS, a spectroscopic classification index called 
specClass. All objects are classified in specClass as either a quasar, high-redshift 
quasar (with z > 2.2 ), galaxy, star, late-type star, or unknown (ranging in value of 
specClass from to 6) by matching emission lines found in the observed spectrum 
against a list of common galaxies and quasar emission lines. We have extracted 
from the SDSS-4 spectroscopic subsample a catalog containing ^ 600000 objects, 
excluding from the query only the objects labelled as 'SKY' according to specClass. 
The percentage distribution of the resulting sample respect to the specClass index is 
the following: specClass 0: 1.5% , specClass 1: 8.5 %, specClass 2: 78.7%, specClass 
3: 8%, specClass 4: 0.1% , specClass 6: 2.8%. Furthermore we excluded from this 
sample drawn from the spectroscopical SDSS-DR4 data all objects fainter than 
r=18, thus obtaining ~ 43000 records. 

4. The experiment 

The unsupervised clustering method here presented is based on the combined use 
of PPS and NEC algorithms. We first applied the PPS algorithm to the sample of 
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spectroscopically selected SDSS DR-4 objects using as parameters for the clustering 
the 4 colors obtained from model magnitudes (u-g,g-r,r-i,i-z) of SDSS archive. We 
fixed the number of latent variables and latent bases of the PPS to 614 and 51 
respectively, so obtaining at the end of this step 614 clusters, each formed by objects 
which only respond to a certain latent variables. We chose a large number of latent 
variables in order to obtain an accurate separation of objects and to avoid that any 
group of distinct but near points in the parameter space could be projected in the 
same cluster by chance. The clusters so found by PPS algorithm are graphically 
represented by groups of points with the same color (a different color for each 
cluster) on the surface of a 2-d sphere embedded in the 3-dimensional latent space. 
These groups of objects are then input to the totally unsupervised agglomeration 
NEC algorithm, whose only free parameter is the dissimilarity threshold T, as above 
mentioned. We performed a plateau analysis to determine the optimal value of this 
threshold: we performed different experiments with T varying over a wide range, 
then selected the central value of intervals of T for which the number of final clusters 
is constant. The number of clusters resulting from the NEC aggregation is 31. We 
present in table ^ a collection of the most interesting of these clusters after labelling 
each object with its specClass index. 



Table 1. specClass distribution of clusters found by 
NEC algorithm after PPS initialization. 
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5. Conclusions 

As can be seen from table ([T]), different groups of clusters, according to their spec- 
Class composition, are found. There is a significant fraction of clusters dominated 
by just one kind of specClass objects and contaminated by few objects flagged 
with different values of specClass. Otherwise, other clusters show a quite homoge- 
neous mixture of spectral type objects, with a prominence of stars(specC/ass=l,6) 
- quasars(specCZass=3,4) and galaxy ( spec CZass=2) - wnknown^specClass—O) mix- 
tures. Only few clusters show comparable proportions of all objects. A more pro- 
found analysis of these mixed clusters and the comparison between the colours and 
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other photometric and spectroscopic informations for the same objects will hopefully 
cast light upon the associations between objects with different values of specClass, 
and remove the degeneracy in the colour space. 
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